Addressing Missing Data

Step 1: Calculate missing data for each variable and sort by missing percentage

In this step, the amount of data missing is calculated for each variable in the cleaned dataframe. Based on these percentages, a subset is created that includes only those variables with 80% or more of the data present. This helps filter out the variables with very large amounts of data that could bias our regression models.

Subset of Variables with < 20% Missingness

Variable Missing_Percentage
sex_at_birth 0.0000000
state 0.0000000
eval_type 0.0019731
has_diabetes 0.0022708
good_health 0.0029124
stroke 0.0034016
asthma_ever 0.0039255
kidney_disease 0.0043663
high_bp 0.0044286
bronchitis 0.0047678
education 0.0053655
cancer 0.0053932
arthritis 0.0059078
depression 0.0059701
michd 0.0105810
age_category 0.0179520
urb_rural 0.0192074
race 0.0220851
blind 0.0362547
height 0.0510428
smoker 0.0532213
ecigs 0.0536574
heavy_drinking 0.0757518
binge_drinking 0.0770995
covid 0.0786803
obese 0.0935445
chol_check 0.1271200
chol_meds 0.1284284
physical_activity 0.1961147
income 0.1999040

Step 2: Chi Square Test for Missingness.

Is missingness in one variable associated with missingness in another variable?

This test intends to determine if variables are Missing at Random (MAR) or Missing Not At Random (MNAR), so we can decide whether to include them in a regression model.

Test 1: Smoking vs. Asthma.

Missingness of asthma_ever: 0.0039
Missingness of asthma_now : 0.856 Missingness of smoker: 0.0532

asthma_now has over 85% missingness, which does not bode well for any kind of model building or data analysis. However, we can analyze if this missingness is related to the smoker variable, which has low missingness.

Smoking vs. Asthma Now:

## 
##  Pearson's Chi-squared test
## 
## data:  data$missing_indicator and data$smoker
## X-squared = 372.22, df = 3, p-value < 2.2e-16

We reject the null hypothesis, as the p-value 2.3008962^{-80} is very small, indicating that missingness in asthma_now is associated with missingness in smoker and probably MNAR. The results of the chi-square test show that using asthma_now in our regression is likely to bias our results.

Test 2: Education and Income vs. Urban/Rural

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data$missing_indicator and data$urb_rural
## X-squared = 31.129, df = 1, p-value = 2.414e-08

We reject the null hypothesis, as the p-value 2.4143979^{-8} is very small, indicating that missingness in education is associated with missingness in urb_rural. Both these variables have low missingness independently, but the results of the chi-square test show that using these variables in our regression could bias our results or that a trend we see in one could be affected by the other.

Let’s test for income vs. urban/rural.

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data$missing_indicator and data$urb_rural
## X-squared = 43.056, df = 1, p-value = 5.319e-11

We reject the null hypothesis, as the p-value 5.318595^{-11} is very small, indicating that missingness in income is associated with missingness in urb_rural. urb_rural has low missingness independently, and income just below 20% missingness, but the results of the chi-square test show that missingness in one could be related to the other. In this case we choose not to include these variables in the model, furthered by our conclusions from our EDA, showing no significant relationship.

Test 3: COVID-19 vs. Bronchitis

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data$missing_indicator and data$bronchitis
## X-squared = 45.436, df = 1, p-value = 1.577e-11

We reject the null hypothesis, as the p-value 1.5773734^{-11} is very small, indicating that missingness in covid is associated with missingness in bronchitis. Both these variables have low missingness independently, so this result shows us that these two variables could be strongly associated and perhaps used as an interaction consideration in our regression model.

Chi Square Test Covid vs. Bronchitis Interaction

Here we test for the significance of this interaction. Based on this result, with p-value 3.7306598^{-123} being very small, we can proceed with including this interaction in our regression model.

Regression

Generalized Linear Models

We choose to run two GLM models, one for has_diabetes, whether someone has diabetes or not, and one for eval_type, the type of diabetes that someone has.

Variables Chosen and Rationale:

sex_at_birth: no missingness, widely known predictor of diabetes risk; for men and women at the same BMI, men have a higher risk.

age_category: age and type of diabetes tend to be related.

obese: has much lower missingness and a similar effect as physical_activity on T2D risk.

race: has been known to affect predisposition to diabetes.

good_health: does being in generally good health mitigate risk of diabetes.

kidney_disease: there is a bidirectional link between diabetes and kidney disease.

stroke: diabetes is associated with risk of cardiovascular complication like stroke.

bronchitis: bronchitis is a comorbidity of T2D, and covid and bronchitis may interact.

covid: between diabetes and covid, risk of one may increase risk of the other.

Model 1: Has Diabetes vs. Does Not Have Diabetes

First, let us generate an initial model based on the predictive variables identified through our literature review. Through this model, we will assess which of the predictive variables actually have a statistically significant association with the has_diabetes variable in our dataset.

# Fit a GLM model
glm_model_db <- glm(
  has_diabetes ~ 
    sex_at_birth + age_category + race + obese + good_health +
    kidney_disease + stroke + covid * bronchitis,
  data = data,
  family = binomial(link = "logit") # Logistic regression
)

Based on the GLM above, we conclude that the following variables have a statistically significant association with the has_diabetes variable.

  • Age (especially any age category over 55)
  • Obesity
  • Overall health
  • Kidney disease
  • History of Stroke
  • Race (especially if White)
  • Sex
  • History of Bronchitis

Here is a table of these significant predictors, arranged by significance (smallest to largest p-value).

Variable Coefficient P-value
age_category55-59 2.659 0.00e+00
age_category60-64 2.820 0.00e+00
age_category65-69 2.990 0.00e+00
age_category70-74 3.077 0.00e+00
age_category75-79 3.135 0.00e+00
age_category80+ 2.936 0.00e+00
obeseYes -1.006 0.00e+00
good_healthYes -1.024 0.00e+00
kidney_diseaseYes 0.925 0.00e+00
age_category50-54 2.412 4.94e-277
age_category45-49 2.083 7.53e-200
age_category40-44 1.624 6.17e-117
strokeYes 0.432 1.17e-101
raceWhite -0.644 4.31e-70
age_category35-39 1.244 7.29e-65
sex_at_birthMale 0.157 2.10e-50
age_category30-34 0.766 1.87e-22
raceMultiracial -0.393 6.62e-15
raceOther -0.477 7.03e-12
bronchitisYes 0.110 5.29e-07
age_category25-29 0.321 2.74e-04
raceHispanic -0.125 1.72e-03
raceNative Hawaiian/Pacific Islander 0.166 3.00e-02

Model 2: Identifying Type 2 Diabetes

Let us now generate an model based on the same predictive variables as before. However, this model will assess which of the predictive variables actually have a statistically significant association with the diab_type variable in our dataset.

# Fit a GLM model
glm_model_dbtype = diabetes_df |> 
  filter(!is.na(diab_type)) |> 
  mutate(diab_type = case_match(diab_type, "Type 1" ~ 0, "Type 2" ~ 1)) |> 
  glm(diab_type ~ sex_at_birth + age_category + race + obese + good_health + 
        kidney_disease + stroke + covid * bronchitis,
      data = _,
      family = binomial(link = "logit") # Logistic regression
)

Based on the GLM above, we conclude that the following variables have a statistically significant association with the diab_type variable.

  • Obesity
  • Age (especially any age category over 30)
  • Race (especially if Hispanic, White, or Black)
  • Sex
  • History of Bronchitis
  • History of Covid
  • Interaction effect of Bronchitis and Covid History

Here is a table of these significant predictors, arranged by significance (smallest to largest p-value).

Variable Coefficient P-value
age_category30-34 1.163 6.43e-05
age_category35-39 1.682 2.11e-09
age_category40-44 1.692 2.14e-10
age_category45-49 2.189 9.18e-17
age_category50-54 2.341 1.01e-19
age_category55-59 2.455 4.42e-22
age_category60-64 2.894 5.70e-30
age_category65-69 2.850 1.22e-29
age_category70-74 3.195 5.74e-36
age_category75-79 3.304 2.12e-37
age_category80+ 3.355 5.24e-38
bronchitisYes 0.338 4.97e-03
covidYes 0.120 3.80e-02
covidYes:bronchitisYes -0.336 4.78e-02
obeseYes -0.948 1.40e-47
raceBlack -0.687 5.23e-03
raceHispanic -0.839 4.96e-04
raceWhite -0.683 3.50e-03
sex_at_birthMale -0.158 3.30e-03

Cross-Validation

First, let us generate test-training pairs from the diabetes data for cross-validation, using the crossv_mc function. In this k-fold cross validation method, we are generating 100 random partitions, splitting the dataset into two subsets, holding out a test subset of the data for training.

crossval_df = diabetes_df |> 
  filter(!is.na(diab_type)) |> 
  mutate(diab_type = case_match(diab_type, "Type 1" ~ 0, "Type 2" ~ 1)) |> 
  crossv_mc(100) |> 
  mutate(
    train = map(train, as_tibble),
    test = map(test, as_tibble)
  )

Next, let us finalize our two models, fit them to the train subsets, and extract root-mean-squared-error (RMSE) to assess predictive accuracy of the models.

# GLM model for predicting has_diabetes
db_model = glm(db_formula, data = diabetes_df, family = binomial(link = "logit"))

# GLM model for predicting diab_type
dbtype_model = diabetes_df |> 
  filter(!is.na(diab_type)) |> 
  mutate(diab_type = case_match(diab_type, "Type 1" ~ 0, "Type 2" ~ 1)) |> 
  glm(dbtype_formula,
      data = _,
      family = binomial(link = "logit"))

Constructing a dataframe with the RMSEs for each model.

# Construct dataframe with extracted RMSE
crossval_results = crossval_df |> 
  mutate(
    db_pred = map(train, \(x) glm(db_formula, data = x)),
    dbtype_pred = map(train, \(x) glm(dbtype_formula, data = x))
  ) |> 
  mutate(
    rmse_db = map2_dbl(db_pred, test, rmse),
    rmse_dbtype = map2_dbl(dbtype_pred, test, rmse)
  )

Finding the average RMSE for each model.

crossval_results |> 
  summarize(
    `RMSE for has_diabetes model` = format(mean(rmse_db), scientific = TRUE, digits = 3),
    `RMSE for diab_type model` = format(mean(rmse_dbtype), digits = 3)
  ) |> 
  knitr::kable()
RMSE for has_diabetes model RMSE for diab_type model
2.25e-13 0.277

Discussion

From the results of our cross validation and error assessment, we can see that the second model, which predicts diab_type appears to perform relatively well, especially considering the volume of missing data we were dealing with. This model had a mean RMSE of less than 30% across the 100-fold validation we performed.

On the other hand, we can see that there are likely some issues with the first model, which predicts has_diabetes. The mean RMSE appears to be very low, close to 0, which is highly unrealistic. It hints at deeper issues within the model, perhaps stemming from the unusually low p-values of significance shown by several of the predictor variables of that model. Unfortunately, due to limited timeframe and project scope, we were not able to pursue this issue further. In future exploration, we would dedicate more time to investigating, correcting, and refining our models.